Statistical Challenges with Massive Data Sets in Particle Physics 
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The massive data sets from today's particle physics experiments present a variety of challenges 
amenable to the tools developed by the statistics community. From the real-time decision of what 
subset of data to record on permanent storage, to the reduction of millions of channels of electronics 
to a few dozen high-level variables of primary interest, to the interpretation of these high-level ob- 
servables in the context of an underlying physical theory, there are many problems that could benefit 
from a higher-bandwidth exchange of ideas between our fields. Examples of interesting problems 
from various stages in the collection and interpretation of particle physics data are provided in an at- 
tempt to whet the appetite of future collaborators with knowledge of potentially helpful techniques, 
and to encourage fruitful discussion between the particle physics and statistics communities. 
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I. INTRODUCTION 

Particle physics is the science focused on the under- 
standing of the most elementary constituents of matter 



and the forces governing their interactions. In stark con- 
trast to the minute size of the objects under investigation, 
enormous experiments are required to further this under- 
standing. A typical next generation experiment — the 
Compact Muon Solenoid (CMS) Q at the Large Hadron 
Collider (LHC), located at the European Organization 
for Nuclear Research (CERN) Q near Geneva, Switzer- 
land — will weigh 12.5 thousand tonnes and have millions 
of channels of electronics, producing nearly 40 terabytes 
of data per second. A cartoon of the CMS detector is pro- 
vided in Fig. n These data will be analyzed in real time 
and reduced to approximately ten terabytes per day that 
will be stored for later analysis. Of the one billion particle 
collisions occurring each second at the LHC, only a few 
are of interest. Finding these interesting — but possibly 
unanticipated — collisions in such a massive data stream 
represents a challenging test of forefront technology and 
computational algorithms. 
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FIG. 1: A cut away view of the CMS detector, currently under 
construction. Note the person included for scale. (Credit: the 

CMS Collaboration) 



Particle physics data come in discrete packets referred 
to as events. In the case of a proton collider experiment, 
an event is a head-on collision of two protons, each mov- 
ing at nearly the speed of light. The energy of the two 
colliding protons can allow the creation of new matter, 
by Einstein's E = mc 2 , which is spewed forth as debris 
from the collision. Particle physics detectors are designed 
to record and identify this debris. 

Sections ITT1 and Hill describe the real-time reduction of 
the massive data stream emerging from particle physics 
detectors to a handful of the most important components 
of those debris, each carrying a direction and energy. 
The total number of observables in the final state after 
this massive reconstruction is roughly two dozen. A car- 
toon characterization of these observables is presented in 

Fig. m 




FIG. 2: A cartoon of an interesting proton anti-proton col- 
lision. In this event two top quarks (t and t) are produced, 
which decay in roughly 10 -24 seconds to six particles, here 
labeled b, , v^, b, e~ , and u e . The energies and directions 
of these final state particles are (imperfectly) measured or 
inferred by the surrounding detector. (Credit: Fermiiab) 

Sections llVl and W\ describe two methods that use the 
resulting two dozen high-level observables to connect to 
the underlying physical theory. Our present understand- 
ing of that underlying model of particle physics is encap- 
sulated in the quantum mechanical "Standard Model," 
which predicts (probabilistically) the debris that will 
emerge. In practice the calculations are typically done 
using Feynman diagrams, named after the late Caltech 
professor, which serve as both a convenient calculational 
tool and an intuitive graphical aid. An example of such 
a diagram is shown in Fig. A set of Feynman rules 
associates each piece of the diagram — each vertex and 
each line — with an algebraic expression, which after 
some manipulation allows a prediction to be made for 
the fraction of collisions that will produce certain types 
of debris. 
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FIG. 3: A Feynman diagram. Time flows to the right. This 
diagram depicts a quark (q) and an anti-quark (q) colliding, 
producing two top quarks (t and i) that decay through W 
bosons (W + and W~) into six particles (6, vi, l + , b, q' , and 



The statistical analysis of these data occurs at many 
levels and in various forms. This paper sketches a few 
of the challenges currently faced in particle physics for 
which statistical methods are being employed, emphasiz- 
ing examples from pattern recognition and signal identi- 
fication. The ultimate challenge is to cull interesting and 
unanticipated events from billions of collisions in a data 
stream flowing at terabytes per second. 



II. TRIGGERING 

If the detector components that record the debris of 
each collision are thought of as the sense organs of the de- 
tector, the trigger should be thought of as a hardwired re- 
flex. Many animals are programmed to respond to quick 
motion in their peripheral vision; analogously, today's 
collider detectors are designed to respond to events with 
certain characteristics. While motion in an animal's pe- 
riphery may take hundreds of milliseconds to trigger the 
appropriate muscular response for identifying or fleeing 
from an apparent threat, the experiments at the LHC 
must decide 40 million times per second whether the col- 
lisions that have occurred are sufficiently interesting to 
be worth recording permanently. 

The large volume of data produced in today's particle 
physics detectors precludes a full analysis of all events; 
it is not even physically possible to record so much data. 
A fast analysis of the data is therefore performed in real 
time at various levels of detail, and data deemed suffi- 
ciently interesting are saved for a more thorough subse- 
quent analysis. 
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A. Trigger levels 

Many of today's particle physics experiments employ 
a trigger system with three levels: 

• Level 1. Custom electronics located physically on 
the detector perform a fast analysis within the de- 
tector components themselves, reducing the data 
that is subsequently passed through communica- 
tion networks downstream. 

• Level 2. Detector components are read out into 
computers that perform a regional analysis of the 
data. 

• Level 3. Data from different regions of the detec- 
tor are brought together in a single CPU to allow 
a final global decision. 

An animal's response to a perceived threat is deter- 
mined by a neural system that can be placed somewhere 
between "hardware" (built-in, unmodifiablc features that 
come with the biological package) and "software" (mod- 
ifiable logic learned as the organism develops). Level 

1 of many particle physics triggers make heavy use of 
"firmware," often in the form of field programmable gate 
arrays (FPGAs). These devices combine the hardware 
advantage of high operational speeds with the flexibility 
and reprogrammability of software, allowing physicists to 
change the algorithms used as experience is gained. 

The Level 2 trigger uses the longer decision time avail- 
able to it to combine information from several sub- 
detectors, and to perform a more sophisticated selection 
of the data than is possible at Level 1. In cases for which 
two bytes of information are available from a particular 
detector element, the Level 1 trigger may use only the 
coarse information available in the first byte; at Level 

2 the full information may be used. Calibration correc- 
tions that are not available in Level 1 can be applied to 
the data at Level 2. It is also possible at this stage to 
consider multivariate analysis methods. 

At Level 3, the entire event is available for consider- 
ation. At this stage a fast version of the final analysis 
can be performed, and time-intensive algorithms can be 
brought to bear. Level 3 typically is allocated « 1 second 
of decision time per event. 

B. Example 

The end cap muon trigger for the CMS experiment Q 
is an example of a practical trigger architecture. This 
trigger identifies muon particles traveling through the 
end caps of the detector £|, the signature for which is 
a charged particle traveling through the entire appara- 
tus. In the end cap region there are chambers, labeled 
CSC's in Fig.0J which can capture the ionization trail of 
the particle. 

As can be seen in Fig. [31 a muon takes a complicated 
path through the chambers, due to the varying magnetic 
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FIG. 4: A cross sectional view of a quarter of the CMS detec- 
tor. The colliding beams run along the axis defined by R — 0, 
with collisions occurring at (z, R) = (0, 0). The end cap muon 
chambers are shown, separated by slabs of steel over a meter 
thick. Muons are unique in that they can penetrate this steel. 

(Credit: the CMS Collaboration) 




FIG. 5: A sketch of the curved path a muon could take 
through the end cap muon system. 



field in the steel structure in which the chambers are em- 
bedded. The Level 1 trigger's goal is to select muons 
that follow paths through the detector that correspond 
to high momentum. Low momentum muons, which ex- 
hibit more bending as they go through the chambers, are 
of lesser interest. There is a probability distribution of 
possible bend angles and paths through the detector for 
a muon of given momentum. The trigger uses these dis- 
tributions, generated by simulation, to select events that 
are likely to contain high momentum muons. The two 
key enabling technologies for this effort have general ap- 
plicability to the real time analysis of massive streams of 
data: sorting in field programmable gate arrays, and the 
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use of memory lookups. 

Field programmable gate arrays are large scale chips, 
comprising many gates and memory cells, that can be 
configured to carry out custom algorithms. Gate ar- 
rays with a million logic units are not uncommon these 
days, and a custom CPU can be built by appropriately 
configuring the gates. In designing high-speed analyses 
of massive data streams, algorithms involving loops are 
to be avoided at all costs. Using FPGAs implementing 
massively parallel algorithms, a printed circuit board has 
been built that can sort in real time the nearly 30 giga- 
bytes of data per second incident to it as part of the CMS 
end cap muon trigger ||. Such techniques have become 
commonplace in experimental particle physics, and have 
been used in a variety of triggers. 

A second key enabling technology involves memory 
lookup devices. In the CMS end cap muon system, the 
muon's momentum between two chambers can be found 
as a function of the difference in bend angle in those 
chambers, 



FIG. 6: A scatter plot of the A0i2 and A</>23 variables for 
simulated muons of pr — 3 (red), 5 (blue), and 10 (green) 
GeV. Overlaid are contours of the function pt{A<Pi2, A023), 
obtained by maximizing the likelihood function in Eq. Q 



PT 



where Aij is determined by fitting the results of a simu- 
lation, and allowed to vary with position in the detector. 
Maximization of the likelihood 



£ = 
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(A0i2 - /X12) 2 2p(A<f>i 2 - ^i2)(A023 - M23) . (A023 - M23) 2 



CT12023 



(1) 



as a function of pt is desired, where fiij = A^/pr, <Jij 
is the error in A^ , and it is assumed that the ratios 
<jij/ 1 Hij are independent of px- The result is shown in 
Fig. The mapping between measured angle differences 
A<^>i2 and A(^3 and the most likely momentum pt is then 
loaded into a memory chip. This enables the calculation 
of a fairly complex function in the time it takes to access 
the memory, in this case about 12.5 nanoseconds ,dj. The 
computational and graphical statistics community could 
make welcome contributions to the determination of suit- 
able functions for memory lookups in similar problems. 



III. RECONSTRUCTION 

For each one-in-a-million collision selected by the trig- 
ger, the primary components of the debris spewed forth 
in the collision are reconstructed. This process is anal- 
ogous to the investigation of a head-on automobile col- 
lision, noting the resultant trajectories of broken doors, 
shattered glass, and strewn metal. The two examples 



considered here are track finding and jet clustering. 



A. Track Finding 

Many of the particles produced in an event are electri- 
cally charged. These can be detected by any of several 
technologies, including the trail of ionization left in a vol- 
ume of gas, the electron-hole pairs liberated in a semi- 
conductor, and the scintillation light emitted upon exci- 
tation of certain complex organic molecules. In all cases 
the detector is trying to measure in three dimensions the 
passage of charged particles, usually by combining three 
two-dimensional views at three different angles through 
the detector. Tracking systems are typically embedded 
in a magnetic field, which bends the track by an amount 
inversely proportional to the momentum of the particle 
perpendicular to the field, allowing a measurement of this 
momentum. 

In Fig. [7| the small dots represent the measured points 
of passage of charged particles. The superimposed lines 
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FIG. 7: End view of a collision in the roughly cylindrical CDF 
detector. Small black dots indicate points through which a 
charged particle has passed, identified in the detector. These 
"hits" are sewn together to reconstruct probable particle tra- 
jectories, shown by curves emanating from the collision at the 
center of the figure. The magnetic field lines flow perpendicu- 
lar to the page, causing the tracks to curve by an amount pro- 
portional to the strength of the magnetic field and inversely 
proportional to the momentum of the particle. (Credit: the cdf 

Collaboration) 



are fits to the tracks. There are two stages to making 
these fits: pattern recognition, in which points that rep- 
resent a common track are linked together; and fitting, 
in which track parameters, such as the momentum of the 
particle and its point of closest approach to the collision 
point, are extracted. The process is complicated by the 
presence of spurious noise hits, measurement error, and 
possible change of the particle's direction through inter- 
action with material. The determination of charged par- 
ticle trajectories can be the most time consuming part of 
the reconstruction of an event; new, more efficient meth- 
ods of track finding and fitting are continuously being 
sought. 

A typical approach to track fitting involves finding a 
set of hit points that represent a possible track, forming 
a x 2 quantifying how well the measured points fit partic- 
ular track parameters, and then minimizing this \ 2 with 
respect to those parameters. Correlations among mea- 
surements, introduced for example by the interaction of 
the particle with the material it is traversing, complicates 
and slows the inversion of the covariance matrix required 
for this minimization. The Kalman filter Q is commonly 
used in practice. 



B. Jet Clustering 

A second common reconstruction problem in particle 
physics is jet clustering in calorimeters, detectors that 
measure the energy of particles entering them. An en- 
ergetic, strongly-interacting particle entering a calorime- 
ter causes a chain reaction of nuclear breakup and par- 
ticle production, resulting in a shower of particles pass- 
ing through the detector. Figure [S] shows a calorimeter 
unfolded out into a two-dimensional array; the height 
of each cell is proportional to the amount of energy de- 
posited in it in this particular event. Quarks produced in 
the original collision fragment into many particles, which 
appear in the detector as a collimated "jet" of energy 
flow. The pattern recognition problem involves identify- 
ing the clumps of energy in the calorimeter that consti- 
tute a jet, allowing an estimate of the energy and direc- 
tion of the parent quark. Lcss-than-perfect theoretical 
understanding of the fragmentation process complicates 
this already messy problem. 



Run 142344 Event 1 669603 Thu Feb 28 01 42:39 2002 
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FIG. 8: An example of jet clustering. A rolled out view of 
the detector is shown. The bin heights correspond to energy 
deposited in cells of the calorimeter. Clusters of energy in 
the detector are associated with jets, shown by the circles 

CndOSing them. (Credit: the D0 Collaboration) 



There are other sources of jets that one would like to 
identify. For example, r leptons can decay into a tightly 
collimated jet of energy in the calorimeter. While the sta- 
tistical distribution of energy from a r decay is different 
on average from that obtained from the fragmentation of 
quarks, there is a significant overlap in the distributions. 
Optimal use of measured information to best separate 
jets from quarks and jets from t leptons is an outstand- 
ing problem in proton collider experiments. 
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IV. KERNEL DENSITY ESTIMATION 

Section [H] discussed how potentially interesting events 
are triggered on in real time; Sec. IIIII sketched how the 
millions of channels of detector information characteriz- 
ing each event are reduced to roughly two dozen numbers 
describing the trajectories of the primary components of 
the debris from each collision. This section and the next 
summarize two methods designed to facilitate the inter- 
pretation of these events in the context of an underlying 
physical theory. 

The problem of interpretation can often be formulated 
as a classification or discrimination problem, with the 
goal being to distinguish between two different hypothe- 
ses. Our understanding of the underlying physics is such 
that one of the two hypotheses is invariably the Standard 
Model, and the second is often an extension of the Stan- 
dard Model. This leads naturally into thinking of the 
problem as searching for a "signal" over a "background." 
Here and below the signal will be denoted by S, and the 
background will be denoted by B. The two hypotheses 
being compared are then schematically A — S + B versus 
B. Particle physicists often perform this comparison in a 
rather simplistic way, carving out one contiguous region 
in a low-dimensional variable space, counting the number 
of events seen in the data within that region, and com- 
paring to the number of events predicted from the two 
hypotheses A and B under consideration. 

The modeling of the response of our detector to this 
debris is done through a rather detailed Monte Carlo sim- 
ulation of each detector subcomponent. In the simulation 
the debris is tracked through detector components, caus- 
ing signals on individual channels in the same way that 
real debris produces such signals in our physical detec- 
tors. The simulated events are then processed using the 
same algorithm as was used on the data. The simulation 
of these events in the detector often requires substantial 
time, on the order of several seconds per event on a mod- 
ern (at the time of this writing) 2 GHz CPU. This limits 
the number of events available to model the predictions 
from different hypotheses of the underlying physics. In 
nearly all cases, time and other practical constraints limit 
the number of simulated events with which we can pop- 
ulate the space of two dozen observables to on the order 
of one million. 



A. The problem 

The issue described in this section could be avoided if 
we had an analytic form for the theoretical prediction in 
our space of observables. The Monte Carlo nature of our 
simulation prevents this; some prescription for smoothing 
out the density of points obtained from this Monte Carlo 
approach is therefore required. Even one million events 
do not adequately populate a ten-dimensional space, so 
physicists are forced to consider a subset of relevant vari- 
ables for testing hypotheses of interest. The nature of 
each hypothesis to be tested typically immediately sug- 
gests a variable space V of low dimensionality that is apt 
to be most useful for distinguishing the hypothesis from 
the Standard Model, which has been the default hypoth- 
esis in the field for the past two decades. The dimension- 
ality of V is typically between one and four, each dimen- 
sion representing a simple algebraic combination of the 
original two dozen observables. In practice, the choice of 
which variables to use to test a given hypothesis is often 
a matter of much consternation and discussion. 



B. Kernel solution 

There are a number of ways to determine the region 
in which to do this counting. One such procedure, still 
used to a surprising extent, is to eyeball the predicted 
distributions for the two hypotheses and choose some 
rectangular box within the variable space by consider- 
ing one-dimensional projections. A more sophisticated 
option, and one enjoying increasing popularity among 
particle physicists, is to use a neural network trained to 
distinguish between the two different hypotheses. 

A third option, and the one we advocate, is to use 
kernels to estimate probability densities for the two hy- 
potheses within the variable space V. The simplest es- 
timate gives the probability densities p(x\S) and p(x\B) 
at a point x within V in terms of the Monte Carlo points 
xsi and XBj in V, each with weight wsi and WBj, drawn 
from the probabilistic predictions of S and B, as 



and 

pfflB) = ^—Y WBj P (-(*-*B i ) T *?V-*B j )/2hB M ) (3) 
Ej w bj ^ V(27r) d \Zji\hB 

I 



Here xsi denotes the position within the d-dimcnsional variable space V of the i Monte Carlo event used to 
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model hypothesis S, and XBj denotes the position of the 
j th Monte Carlo event used to model hypothesis B. The 
width of the Gaussian kernels are set by the covariancc 
matrices Eg and E^ of the points {xsi} and {xsj}, and 
by smoothing parameters hs and h,B that depend upon 
the number of Monte Carlo events available and the di- 
mensionality of V. 

The two densities p(x\S) and p(x\B) are then used to 
define a discriminant 



D(x) 



p(x\S) 



p(x\S)+p(x\B) 



(4) 



The discriminant D(x) is a function of position within the 
variable space V; at each point within that space D{x) 
is a number between zero and unity, taking on values 
close to zero in regions that are predominantly populated 
by the types of collisions predicted by the hypothesis B, 
and taking on values that are close to unity in regions 
of V that are predominantly populated by the types of 
collisions predicted by the hypothesis S. 

The set of all points x within V for which D(x) > D cut 
thus defines a region in the variable space V in which the 
contribution of collisions from the signal hypothesis S is 
enhanced relative to the Standard Model background B. 
Here D cut is some number between zero and one, chosen 
to optimize a figure of merit. A common choice for this 
figure of merit is the number of events predicted by S 
in the region divided by the square root of the number 
of events predicted by B in the region, although more 
sophisticated choices for this figure of merit have been 
considered. 

The number of events observed in the data within the 
region of V thus chosen is compared to the number of 
events predicted within that region by the "background" 
hypothesis B and the "signal + background" hypothesis 
S + B, and from this comparison constraints are set on 
parameters internal to the hypothesis S, in some cases 
ruling out S entirely. 

One problem shared among all of these methods is that 
they are not sufficiently prescriptive. The simple selec- 
tion of a rectangular box in a multivariate space typically 
takes some fiddling, since a completely prescriptive pro- 
cedure for choosing this box is lacking. The parameters 
of a neural network typically have to be manipulated to 
achieve reasonable performance. In the use of kernels, 
the choice of the width of each kernel and its covariancc 
also lacks a sufficiently robust and general prescription, 
despite many proposals existing in the literature. Ob- 
taining a prescription that works reasonably and without 
pathologies in all cases would be incredibly helpful. 



C. Example: Quaero 

An attempt at making the procedure sketched above 
completely prescriptive has been implemented in an algo- 
rithm called Quaero || . Particle physics experimenters 
are notoriously guarded with their data, rarely sharing 



data in "raw" form with physicists outside their large 
collaborations. The rationale for this is that the data in 
raw form are complicated by detailed effects and foibles 
of the detectors. In order to perform a meaningful analy- 
sis of the data, the detector must be understood in much 
greater detail than anyone outside of the collaboration 
can reasonably hope to achieve. 

This presents an obstacle to progress in particle 
physics, since the normal steps of testing any particu- 
lar hypothesis within the collaboration (and then obtain- 
ing official collaboration approval of the result) typically 
takes greater than a year. The solution to this conun- 
drum is to package the knowledge of the physicists within 
the collaboration into a tool that can perform an analysis 
automatically — a tool that "knows" how to correct for 
the foibles of the experimental apparatus. 



Quaero 

A General Interface to 1)0 Data 



PKL Manual 
Examples 



Final State : |e mu (met) (nj)j^J Smear? 17 



G Pythia Input : 
C Signal File : f 



Browse... | xsec : ( pb 



C View C Search 



Backgrounds : 17 (nj) 17 Wfait 17 Z(n\) !7 W(n\) 17 tt(n\) 
Constraints : \~ 



Variables: 



vl| 
v2| 
v3| 



J 



J 



Requestor 



Name: 



Institution: T 
Email: | 



Brief description of model: 



Help! - Bug history - D0 - Fermjlab - Author 



FIG. 9: The Quaero interface to D0 data. A physicist out- 
side the collaboration is able to provide his hypothesis in the 
form of the types of events he would expect should be observed 
were his hypothesis correct. Quaero performs the complete 
analysis, correctly accounting for expert knowledge of detec- 
tor imperfections built into the algorithm, and returns to the 
requesting physicist constraints imposed by the data on the 

prOpOSed hypOthesiS. (Credit: tho D0 Collaboration) 

Figure shows the Quaero web page, through which 
the DO collaboration has made a subset of its data pub- 
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licly available. A physicist can provide a file describ- 
ing the type of events his model predicts should be seen 
in the detector, together with at most three variables 
that can be used to distinguish his signal (S) from the 
Standard Model (B). Quaero then constructs kernel 
density estimates p{x\S) and p(x\B) of the signal and 
background in the variables that the user has provided, 
uses these to construct the discriminant D(x) according 
to Eq. 21 chooses a number D cu t to maximize a reason- 
able figure of merit, and from this determines the region 
of the variable space V for which D{x) > D cut . The 
number of events observed within that region in the data 
are then counted and compared to the number of events 
predicted by the physicist's signal and to the number of 
events predicted by the Standard Model alone. From this 
comparison, parameters internal to the physicist's model 
are constrained. Examples of the densities formed in this 
procedure are shown in Fig. 1101 

Background density Signal density Selected region 
(a) (b) (c) 




FIG. 10: The background density (a), signal density (b), and 
selected region (shaded) (c) determined by Quaero in several 
examples. The dots in the plots in the rightmost column 
represent events observed in the data. (Credit: the do collaboration) 

Quaero uses kernels whose shape is determined from 
the computation of a local covariance matrix, which ap- 



pears to provide reasonable performance in most of the 
analyses that have been attempted. It should be noted, 
however, that the reliance upon the "correctness" of the 
density estimate is in some sense minimal — the density 
estimate is used only within a prescription to determine 
a particular region within the variable space V that en- 
hances the contribution of signal relative to background. 
Using the kernel density estimate directly to compute a 
likelihood requires confidence that the density estimate 
prescription used will yield non-pathologic results in all 
cases, and a prescription generating sufficient confidence 
is currently lacking. 

By automating the testing of individual hypotheses, 
Quaero dramatically decreases the time required to test 
models against large sets of data. Each such test, for- 
merly the subject of a Ph.D. theses, can now be done in 
an hour. 



V. SLEUTH 

Because Quaero is designed to facilitate the testing of 
specific hypotheses against a large set of data, it requires 
that specific hypotheses be defined. A separate algorithm 
is required to allow searches in the data for signatures 
of a more general type. Sleuth looks for anomalous 
deviations in the data that could indicate the presence 
of interesting physics, allowing the simultaneous tes ting 
of a large number of vaguely-specified hypotheses M, [l£J, 

mm 



A. The problem 

The Standard Model of particle physics has so far 
passed the experimental tests to which it has been sub- 
jected. Nonetheless, there is a hole in the Standard 
Model that indicates there are likely to be new funda- 
mental discoveries at energy scales that our accelerators 
are just beginning to probe. Exactly what form that 
new physics might take is currently unknown, despite the 
work of hundreds of people over the last two decades. 

Possible findings include 

• particles with bare magnetic charge ("magnetic 
monopoles" ) ; 

• a new symmetry of Nature ( "supersymmetry" ) ; 

• a new strong force ( "technicolor" ) ; 

• the presence of a new weak force ("heavy gauge 
bosons"); 

• the existence of large extra spatial dimensions, 
curled up on scales smaller than 1 mm ("extra di- 
mensions" ) ; 

• electrons in excited states ("excited fermions"); 
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• laws of nature that arc not isotropic ("non- 
commutative theories"); 

• a heavier analog of the electron ( "fourth generation 
of fermions" ) ; and 

• evidence pointing to a unification of the strong, 
weak, and electromagnetic forces ("grand unified 
theories" ) . 

Each of these possibilities represents a class of theories 
with a number of adjustable parameters. When taken to- 
gether, the model space is sufficiently large that system- 
atically checking all possibilities is not a viable option. 
We know only vaguely what it is we should be searching 
for; equivalently, we are searching for more things than 
can possibly be tested at one time. 

Humans are notoriously good at finding patterns, par- 
ticularly when dealing with small numbers of events — 
the history of particle physics is strewn with cases of 
patterns being mistakenly discerned. An algorithm is re- 
quired that will enable a general search for all possibilities 
simultaneously, rigorously taking into account the "trials 
factor" (a measure of the number of different possibilities 
considered) when reporting a final number. 

B. Solution 

The solution of this problem in the context of particle 
physics at accelerators that collide protons or their an- 
tiparticles is an algorithm called Sleuth. Consideration 
of the many possibilities just mentioned naturally leads 
one to ask whether there is any common feature among 
them. If such a common feature exists, perhaps it can be 
searched for in a general way. 

It turns out that such a commonality does in fact exist, 
justifying the following three assumptions: 

• the data can be partitioned in such a way that a 
signal will predominantly appear within a single 
category; 

• interactions signaling the presence of new physics 
will generally produce objects with larger energy 
than expected from background processes; and 

• a signal is apt to appear as an excess of events — 
i.e., more events observed in the data than expected 
from background. 

Sleuth begins by taking all of the data collected in 
the experiment and partitioning it into categories. This 
partitioning is orthogonal; each event ends up in one and 
only one of these categories. The partitioning is chosen 
such that if new physics appears in the data, it is likely to 
end up predominantly in a single category. Each category 
contains a set of "similar" events, in the sense that the 
events in each category contain the same types of debris 
from the collision. 



Within each category, d variables (where d ranges be- 
tween one and four, depending upon the category) are 
chosen. In the case of Sleuth, these variables corre- 
spond roughly to the energies of the objects produced 
in the collision. The variables to be used for each cate- 
gory are set by a rule agreed upon before the data are 
analyzed. 

An arbitrary number of regions can be drawn in the 
variable space just defined. In order to discretize the 
regions that can be considered, the following procedure 
is adopted. A multivariate change of variables is made, 
such that in the new variables the prediction of the de- 
fault hypothesis (the Standard Model) is a uniform dis- 
tribution in [0, l] d . We refer to [0, l] d for arbitrary d as 
the "unit box," meaning unit interval when d = 1, unit 
square when d = 2, unit cube when d = 3, and so forth. 
Regions are then defined about sets of data points using 
the concept of Voronoi diagrams, an example of which 
is shown in Fig. 1111 The region around any single data 
point is the space within the unit box closer to that data 
point than to any other data point in the unit box. The 
number of regions — all possible unions of these individ- 
ual regions — that can be considered has been reduced by 
this process from uncountably infinite to a mere 2 JVdata , 
where A<j a ta is the number of events observed in the data 
in this category. Because not all such regions are phys- 
ically interesting, the regions considered are restricted 
to those containing the "upper right-hand corner" (l d ) 
of the unit box and satisfying other criteria favoring re- 
gions in which all variables take on values close to 1 . New 
physics is expected to appear in collisions with energetic 
particles in the final state; our mapping into the unit box 
is done such that these events will take on values close to 
the upper right hand corner. 

The "interestingness" of each region is quantified as 
the Poisson probability that the background within that 
region will fluctuate up to or beyond the observed num- 
ber of events. A search heuristic is used to find the most 
interesting region 1Z in each category in the data. The 
fraction V of hypothetical similar experiments, in which 
a set of "pseudo" data points are drawn from the back- 
ground distribution, that give rise to a result more inter- 
esting than the most interesting region observed in the 
data is determined. V is a number between zero and 
unity; V will be a (uniformly-distributed) random num- 
ber between zero and unity if the data are composed of 
background only; V should be close to zero if there re- 
ally is something interesting in the data. One value of 
V is determined for each category. The minimal V in 
the (roughly 100) categories considered determines V, 
the fraction of hypothetical similar experiments in which 
a more interesting region would be found in any of the 
categories considered. This V is the final measure of 
"interestingness." If the data are distributed according 
to background only, V will be a number randomly dis- 
tributed in the unit interval. If the data contain a hint 
of a signal, V will (hopefully) be small. The output of 
Sleuth is thus twofold: the most interesting region 1Z 
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FIG. 11: The seven black circles in each panel are data points 
within the unit square. The Voronoi diagram (a) is formed 
by drawing perpendicular bisectors of imaginary line segments 
connecting neighboring pairs of points; the resulting lines par- 
tition the unit square into regions with the property that each 
space point inside the region is closer to the single data point 
inside that region than to any other data point in the square. 
Regions considered by Sleuth are unions of these individual 
regions, such as the shaded region in (b). Criteria are imposed 
upon the regions that Sleuth is allowed to consider, includ- 
ing the criterion that the region include the "upper right-hand 
corner" (1,1) of the unit square, as shown in (b). (Credit: the 



D0 Collaborate 



in any of the categories considered, and a quantitative 
measure V of how interesting that region actually is. 



C. Results 

By construction, Sleuth will not find anything if there 
is nothing there to be found. But would Sleuth find 
something if there were something there to be found? 
This is impossible to address adequately in the general 
case; the answer depends to what degree the underly- 
ing physical assumptions made in Sleuth are realized in 
Nature. The question can, however, be definitively an- 
swered in any specific scenario, and several studies have 
been performed in order to test Sleuth's sensitivity to a 
variety of new physics. The results of one such example 
are shown in Fig.^J Other studies in various categories 
with different signals indicate that Sleuth performs as 
anticipated, the success of the algorithm in each case de- 
pending upon the strength of the signal and the extent 
to which the signal satisfies the three assumptions on 
which the algorithm is based. Such studies indicate that 
Sleuth has a reasonable shot of discovering new physics 
were it present in our data. 

Sleuth has been applied to a range of particle collider 
data collected between 1992-1996 by the D0 experiment 
at the Fermilab Tevatron. Disappointingly, a null result 
has been obtained. The Vs obtained in thirty-two differ- 
ent categories are histogrammed in Fig. I14f b) , and the 
conversion to V given the number of categories consid- 
ered is shown in Fig. I14f a). 
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FIG. 12: Examples of Sleuth's performance on a number of 
mock experiments for a particular signal (denoted tt). This 
particular signal manifests itself in four categories (here la- 
beled Wjjj, Wjjjj, Wjjjjj, and Wjjjjjj). The number of 
"background" (Bkg) events predicted by the Standard Model 
and by the tt signal is shown in the legend. The horizon- 
tal axis is the minimum V found in the four categories, in 
units of standard deviation; the vertical axis is the fraction of 
mock experiments in which this value of V is obtained. The 
dashed line is the minimum V found when the data in the 
mock experiments are pulled from the background distribu- 
tion; the solid line is the minimum V found when the data in 
the mock experiments are pulled from the background with 
an admixture of the tt signal. (Credit: the do collaboration) 



D. Remarks 



The general problem of looking for a signal that is 
only vaguely specified is of course not peculiar to parti- 
cle physics, and the general thread of the solution should 
therefore be applicable to any problem of similar type. 
The individual steps of the algorithm — partitioning 
the data into categories, defining a small set of variables 
for each category, rigorously defining regions within each 
variable space, specifying criteria for the regions to be 
considered, quantifying the intercstingncss of any partic- 
ular region, searching the data to find the most interest- 
ing region within each category, conducting hypothetical 
similar experiments to determine V for each category, 
and conducting a second set of hypothetical similar ex- 
periments to determine V for the data as a whole - 
are generally modifiable to many problems. Where the 
size of the data set is an issue, a specific solution can 
often be found that scales with the number of events n 
as O(nlogn). 
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FIG. 13: Examples of Sleuth's evaluation of data in two 
categories (denoted W 2j and Z2j). The open circles are 
data points outside the region selected by Sleuth; the filled 
circles are data points inside the region selected by Sleuth. 
The value of V determined in each case is shown upper right. 

(Credit: the D0 Collaboration) 
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FIG. 14: For each of the categories considered at D0, one 
value of V was determined. These values, in units of standard 
deviation, are histogrammed as filled circles in (b); the solid 
line shows the distribution expected if the data consist of 
background only. The minimum V found in all categories 
considered must be mapped to the final output V by taking 
into account the many categories that have been considered. 
The horizontal axis in (a) represents the minimum V found, 
in units of standard deviation; the vertical axis represents the 
final output V, also in units of standard deviation; the blue 
curve shows the relationship between the minimum V found 
and the corresponding final output V . At D0, the minimum 
V was found to be 0.04; from (a), this is seen to correspond 

tO T* = 0.89. (Credit: the D0 Collaboration) 



VI. SUMMARY AND DISCUSSION 

The massive data sets from this decade's particle 
physics experiments present a variety of challenges. The 
selection of which collisions to record on permanent stor- 
age (triggering), discussed in Sec. [HI presents a need 
for smart algorithms whose implementation in a paral- 
lel hardware architecture can take the high rate of inci- 
dent data. Understanding and optimizing the necessary 
trade-offs imposed by the constraints of each system in- 
volve design patterns that are still being learned. Impos- 
ing deadlines and ignorance of theoretical best practice 
result in systems that can be noticeably suboptimal. 



The reconstruction of the trajectories of the primary 
components of the debris emerging from the collision, 
sketched in Sec. IIIII involves a number of pattern recog- 
nition problems that can be handled at the more leisurely 
pace of one second per collision. Track finding, a connect- 
the-dots problem of identifying curved particle trajecto- 
ries in the presence of noise and complicating interactions 
of the particle with the traversed material, is prototypi- 
cal problem. Tracking flying objects with sweeping radar 
or swimming objects emitting infrequent signals are two 
of many examples where similar issues are encountered. 
Jet clustering, the clustering of the debris observed in the 
detector, has similarities to many of the clustering prob- 
lems currently treated in the statistics literature. These 
are two areas in which the statistical community's impact 
on current practice in our field has been significantly less 
than it could be. 

Kernel Density Estimation has been recognized by the 
field as a useful technique for estimating parent densities 
of finite samples, the limited statistics of which are often 
imposed by the time required to simulate what would be 
observed in our detectors if a particular model were re- 
alized in nature. Quaero, the intelligent web interface 
to D0 proton anti-proton collision data described in Sec- 
tion II VI serves as an existence proof that the testing of 
particular hypotheses against high energy collider data 
can be formalized and automated. Outstanding issues 
here include whether a completely prescriptive kernel 
density algorithm can be achieved that avoids pathologies 
when presented with outliers, delta functions in an other- 
wise continuous distribution, and discontinuous bound- 
aries. To be a bit more provocative: For every exist- 
ing kernel estimation prescription of which we arc aware, 
a sample exists for which the prescription produces a 
clearly undesirable estimate of the parent density. Pre- 
sumably a prescription that is never obviously unreason- 
able can be achieved, and this would be of great value. 
Such a prescription whose time cost grows only linearly 
with the size of the sample (even in the multidimensional 
case) would be of even greater value. 

The problem of how to look for a vaguely specified 
feature in a large set of data, discussed in Section is 
omnipresent in scientific research. Having the ability to 
rigorously quantify the "intcrcstingncss" of a particular 
subset of collisions in an unbiased manner is found to be 
crucial. This basic lesson can be widely generalized. The 
solution (Sleuth) is obtained by constructing a rigor- 
ous quantification of "interestingness" before the data is 
analyzed, so that pseudo experiments can be simulated 
to determine what fraction of them would yield data as 
interesting as what is actually observed. Obvious ap- 
plications range from identifying statistically significant 
galaxy clusters to identifying statistically significant dis- 
ease clusters. 

Opportunities for the computational and graphical 
statistics community to contribute to the solutions of 
problems faced in particle physics abound; this article 
has barely scratched the surface. We look forward to a 
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productive exchange of ideas over the coming decade. 
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